Exploratory Data Analysis (EDA)¶
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, employing data visualization.
This notebook includes:
- EDA: missingness, distributions, feature–feature correlations, feature–target correlations
- Per-Country EDA (DE/FR)
- Time Series analysis (global & per-country): rolling means, autocorrelation function (ACF), cross-correlation with target
- Simple feature engineering: simple imputation comparison, simple selection, mutual information(MI), principal component analysis (PCA)
1. Configuration and Preparing¶
In [1]:
# ==== Configuration ====
DATA_DIR = "../data/"
X_TRAIN_PATH = f"{DATA_DIR}X_train_NHkHMNU.csv"
Y_TRAIN_PATH = f"{DATA_DIR}y_train_ZAN5mwg.csv"
# Time series analysis config
ROLL_WINDOWS = [7, 14, 30]
RANDOM_STATE = 47
In [2]:
# ==== Imports ====
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import warnings
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold, mutual_info_regression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
import seaborn as sns
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (8, 5) # one chart per figure
In [3]:
# ==== 1) Load data & basic checks ====
X = pd.read_csv(X_TRAIN_PATH)
y = pd.read_csv(Y_TRAIN_PATH)
assert "ID" in X.columns and "ID" in y.columns, "ID column must exist in both X and y"
df = X.merge(y, on="ID", how="left")
print("Shapes -> X:", X.shape, " y:", y.shape, " merged:", df.shape)
print("Columns:", list(df.columns))
display(df.head())
Shapes -> X: (1494, 35) y: (1494, 2) merged: (1494, 36) Columns: ['ID', 'DAY_ID', 'COUNTRY', 'DE_CONSUMPTION', 'FR_CONSUMPTION', 'DE_FR_EXCHANGE', 'FR_DE_EXCHANGE', 'DE_NET_EXPORT', 'FR_NET_EXPORT', 'DE_NET_IMPORT', 'FR_NET_IMPORT', 'DE_GAS', 'FR_GAS', 'DE_COAL', 'FR_COAL', 'DE_HYDRO', 'FR_HYDRO', 'DE_NUCLEAR', 'FR_NUCLEAR', 'DE_SOLAR', 'FR_SOLAR', 'DE_WINDPOW', 'FR_WINDPOW', 'DE_LIGNITE', 'DE_RESIDUAL_LOAD', 'FR_RESIDUAL_LOAD', 'DE_RAIN', 'FR_RAIN', 'DE_WIND', 'FR_WIND', 'DE_TEMP', 'FR_TEMP', 'GAS_RET', 'COAL_RET', 'CARBON_RET', 'TARGET']
| ID | DAY_ID | COUNTRY | DE_CONSUMPTION | FR_CONSUMPTION | DE_FR_EXCHANGE | FR_DE_EXCHANGE | DE_NET_EXPORT | FR_NET_EXPORT | DE_NET_IMPORT | ... | DE_RAIN | FR_RAIN | DE_WIND | FR_WIND | DE_TEMP | FR_TEMP | GAS_RET | COAL_RET | CARBON_RET | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1054 | 206 | FR | 0.210099 | -0.427458 | -0.606523 | 0.606523 | NaN | 0.692860 | NaN | ... | -0.172680 | -0.556356 | -0.790823 | -0.283160 | -1.069070 | -0.063404 | 0.339041 | 0.124552 | -0.002445 | 0.028313 |
| 1 | 2049 | 501 | FR | -0.022399 | -1.003452 | -0.022063 | 0.022063 | -0.573520 | -1.130838 | 0.573520 | ... | -1.240300 | -0.770457 | 1.522331 | 0.828412 | 0.437419 | 1.831241 | -0.659091 | 0.047114 | -0.490365 | -0.112516 |
| 2 | 1924 | 687 | FR | 1.395035 | 1.978665 | 1.021305 | -1.021305 | -0.622021 | -1.682587 | 0.622021 | ... | -0.480700 | -0.313338 | 0.431134 | 0.487608 | 0.684884 | 0.114836 | 0.535974 | 0.743338 | 0.204952 | -0.180840 |
| 3 | 297 | 720 | DE | -0.983324 | -0.849198 | -0.839586 | 0.839586 | -0.270870 | 0.563230 | 0.270870 | ... | -1.114838 | -0.507570 | -0.499409 | -0.236249 | 0.350938 | -0.417514 | 0.911652 | -0.296168 | 1.073948 | -0.260356 |
| 4 | 1101 | 818 | FR | 0.143807 | -0.617038 | -0.924990 | 0.924990 | NaN | 0.990324 | NaN | ... | -0.541465 | -0.424550 | -1.088158 | -1.011560 | 0.614338 | 0.729495 | 0.245109 | 1.526606 | 2.614378 | -0.071733 |
5 rows × 36 columns
In [4]:
print("\nCOUNTRY value counts:")
if "COUNTRY" in df.columns:
print(df["COUNTRY"].value_counts(dropna=False))
else:
print("No COUNTRY column found.")
NON_FEATURE = {"ID", "COUNTRY", "TARGET"}
num_cols = [c for c in df.columns if c not in NON_FEATURE and pd.api.types.is_numeric_dtype(df[c])]
print("Numeric feature count:", len(num_cols))
COUNTRY value counts: COUNTRY FR 851 DE 643 Name: count, dtype: int64 Numeric feature count: 33
2. Missing Data Profiling¶
In [5]:
missing_counts = df.isna().sum().sort_values(ascending=False)
missing_pct = (missing_counts / len(df)).sort_values(ascending=False)
missing_df = pd.DataFrame({"missing_count": missing_counts, "missing_pct": missing_pct})
display(missing_df)
| missing_count | missing_pct | |
|---|---|---|
| DE_NET_IMPORT | 124 | 0.082999 |
| DE_NET_EXPORT | 124 | 0.082999 |
| FR_TEMP | 94 | 0.062918 |
| DE_TEMP | 94 | 0.062918 |
| DE_WIND | 94 | 0.062918 |
| FR_WIND | 94 | 0.062918 |
| FR_RAIN | 94 | 0.062918 |
| DE_RAIN | 94 | 0.062918 |
| FR_NET_EXPORT | 70 | 0.046854 |
| FR_NET_IMPORT | 70 | 0.046854 |
| DE_FR_EXCHANGE | 25 | 0.016734 |
| FR_DE_EXCHANGE | 25 | 0.016734 |
| DE_CONSUMPTION | 0 | 0.000000 |
| COUNTRY | 0 | 0.000000 |
| DAY_ID | 0 | 0.000000 |
| ID | 0 | 0.000000 |
| FR_GAS | 0 | 0.000000 |
| DE_COAL | 0 | 0.000000 |
| FR_CONSUMPTION | 0 | 0.000000 |
| DE_GAS | 0 | 0.000000 |
| DE_SOLAR | 0 | 0.000000 |
| FR_NUCLEAR | 0 | 0.000000 |
| DE_NUCLEAR | 0 | 0.000000 |
| FR_HYDRO | 0 | 0.000000 |
| DE_HYDRO | 0 | 0.000000 |
| FR_COAL | 0 | 0.000000 |
| FR_SOLAR | 0 | 0.000000 |
| DE_WINDPOW | 0 | 0.000000 |
| DE_RESIDUAL_LOAD | 0 | 0.000000 |
| FR_RESIDUAL_LOAD | 0 | 0.000000 |
| FR_WINDPOW | 0 | 0.000000 |
| DE_LIGNITE | 0 | 0.000000 |
| GAS_RET | 0 | 0.000000 |
| COAL_RET | 0 | 0.000000 |
| CARBON_RET | 0 | 0.000000 |
| TARGET | 0 | 0.000000 |
In [6]:
# Plot top missing features — single figure
top_miss = missing_df.head((missing_df['missing_count'] > 0).sum())
plt.figure()
plt.subplot(2, 1, 1)
plt.bar(top_miss.index.astype(str), top_miss["missing_pct"].values)
plt.xticks(rotation=90)
plt.ylabel("Missing %")
plt.title("Top Missing Features")
plt.tight_layout()
plt.show()
3. Check Outlier Data¶
In [7]:
plots = [("TARGET", y["TARGET"])] + [(c, X[c]) for c in num_cols[1:-1]]
K = len(plots)
N = math.ceil(math.sqrt(K))
M = math.ceil(K / N)
fig, axes = plt.subplots(M, N, figsize=(N*4, M*3), squeeze=False)
axes = axes.ravel()
for i, (name, s) in enumerate(plots):
sns.boxplot(x=s, ax=axes[i])
axes[i].set_title(f"Boxplot of {name}")
plt.tight_layout()
plt.show()
4. Distributions of Features¶
In [8]:
plot_cols = num_cols[1:]
for c in plot_cols:
plt.figure()
plt.subplot(2, 1, 1)
if "COUNTRY" in df.columns:
for k, g in df.groupby("COUNTRY"):
g[c].plot(kind="hist", bins=40, alpha=0.5, label=str(k))
df[c].plot(kind="hist", bins=40, alpha=0.5, label='DE&FR')
plt.legend()
else:
df[c].plot(kind="hist", bins=40, alpha=0.7, shade=True)
plt.xlabel(c)
plt.title(f"Distribution of {c}")
plt.tight_layout()
plt.show()
5. Feature-Feature Correlations¶
In [9]:
sns.PairGrid(X.merge(y, on='ID', how='left')).map(plt.scatter,s=5)
Out[9]:
<seaborn.axisgrid.PairGrid at 0x2534b2da560>